feat: Docker deployment improvements, async fixes, and resilience#928
feat: Docker deployment improvements, async fixes, and resilience#928steveant wants to merge 5 commits intocoleam00:mainfrom
Conversation
📝 WalkthroughWalkthroughReplaces React Router usage with v7 experimental flags; adds stale-credential cleanup script and makes migration SQL idempotent; converts MCP server init to async with serialized lock; adds cached, timeout-bounded DB schema health checks; adds retries for credential loading; introduces embedding-schema detection, upsert storage and fallback; makes a code-summary function async; expands Supabase URL local validation; minor version bumps and changelog/README updates. Changes
Sequence Diagram(s)sequenceDiagram
participant App as DocumentStorageService
participant EmbedSupport as EmbeddingSchemaSupport
participant DB as SupabaseDB
participant RetryLogic as Retry/Fallback
App->>EmbedSupport: determine_embedding_column(dimension)
EmbedSupport-->>App: chosen column (per-dim or legacy)
App->>DB: upsert(batch_data with chosen column)
DB-->>App: error (missing per-dim column)
App->>RetryLogic: should_retry_with_legacy_column(error, records)
RetryLogic->>EmbedSupport: _convert_records_to_legacy(records)
EmbedSupport-->>RetryLogic: records converted (in-place)
RetryLogic-->>App: retry = true
App->>DB: upsert(batch_data with legacy column)
DB-->>App: success
App->>EmbedSupport: note_multi_dim_success()
sequenceDiagram
participant Client as MCP Client
participant Server as MCP Server
participant Lock as Async Init Lock
participant Init as Services Init
Client->>Server: connection request
Server->>Server: check _shared_context
alt context exists
Server-->>Client: yield cached context
else
Server->>Lock: acquire async lock
Lock->>Init: initialize session manager & services
Init-->>Server: store _shared_context
Lock-->>Server: release lock
Server-->>Client: yield initialized context
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@python/src/server/services/credential_service.py`:
- Around line 156-173: The except blocks in credential_service.py currently log
errors without stack traces and leave an unreachable RuntimeError at the end;
update the two logger.error(...) calls inside the APIError and generic except
blocks to pass exc_info=True so the full traceback is preserved, and remove the
final raise RuntimeError("Failed to load credentials after retries") since
control either returns, continues, or re-raises inside the loop; keep existing
re-raise behavior for the exceptions (do not swallow them).
In `@python/src/server/services/storage/document_storage_service.py`:
- Around line 475-478: The retry loop currently uses continue when
should_retry_with_legacy_column(e, batch_data) mutates batch_data, which can
skip the actual retry if that happens on the final iteration; change the logic
so that when should_retry_with_legacy_column(...) returns True you set a boolean
flag (e.g., force_one_more_retry) and, if on the last configured retry
iteration, arrange for one additional loop iteration (for example by checking
force_one_more_retry in the loop condition or incrementing the retry limit) so
the mutated batch_data is actually retried; ensure the flag is cleared after the
forced retry so normal retry behavior resumes.
🧹 Nitpick comments (7)
python/src/mcp_server/mcp_server.py (1)
150-198: Solid async initialization pattern with one minor logging improvement.The structure is well-designed:
- Local
ctxvariable avoids race conditions between lock release and yield- Yielding outside the lock enables concurrent connections
- Double-check pattern prevents redundant initialization
Per coding guidelines, prefer
exc_info=Trueover separatetraceback.format_exc()call to keep the stack trace in the same log entry.♻️ Suggested logging improvement
except Exception as e: - logger.error(f"💥 Critical error in lifespan setup: {e}") - logger.error(traceback.format_exc()) + logger.error(f"💥 Critical error in lifespan setup: {e}", exc_info=True) raisemigration/cleanup_stale_credentials.py (3)
42-52: Simplify exception handling.The broad
Exceptioncatch on line 51 is redundant sinceInvalidTokenandValueErroralready cover the specific failure cases. Consider removing it or logging the unexpected exception type for debugging.♻️ Suggested simplification
def try_decrypt(encrypted_value: str, fernet: Fernet) -> bool: """Attempt to decrypt a value, return True if successful.""" if not encrypted_value: return True # Empty values are fine try: encrypted_bytes = base64.urlsafe_b64decode(encrypted_value.encode("utf-8")) fernet.decrypt(encrypted_bytes) return True - except (InvalidToken, ValueError, Exception): + except (InvalidToken, ValueError): return False
122-133: Consider adding a dry-run mode for safer operation.The script permanently deletes credentials that can't be decrypted. While logging provides an audit trail, a
--dry-runflag would allow operators to preview what would be deleted before executing, reducing risk of accidental data loss.
55-55: Add return type annotation.The function returns
intbut lacks a type annotation. Per coding guidelines for Python backend, type hints should be used.♻️ Suggested fix
-def main(): +def main() -> int:python/src/server/main.py (1)
395-398: Consider caching unknown errors briefly.Unknown errors return
{"valid": False}without caching, which could lead to repeated database queries if the error persists. Consider caching these briefly (e.g., 5-10 seconds) to prevent query storms during database issues.python/src/server/services/storage/embedding_schema_support.py (2)
40-47: Consider simplifying the fallback logic for clarity.The nested
.get()on line 47 is correct but slightly obscure. Since1536is guaranteed to exist in_DIMENSION_TO_COLUMN, the inner fallback toLEGACY_COLUMNis unreachable.🔧 Optional simplification
- return _DIMENSION_TO_COLUMN.get(dimension, _DIMENSION_TO_COLUMN.get(1536, LEGACY_COLUMN)) + return _DIMENSION_TO_COLUMN.get(dimension, "embedding_1536")
87-97: Potential data loss if a record contains multiple dimension columns.If a record somehow has more than one dimension column (e.g., both
embedding_768andembedding_1536), the current logic pops all of them but only preserves the first one viasetdefault. Subsequent embeddings are silently discarded.In practice, records should only have one embedding column, but adding a
breakafter conversion would make the intent explicit and prevent accidental data loss.🔧 Suggested defensive improvement
for record in records: for column in DIMENSION_COLUMNS: if column in record: # Do not clobber an existing legacy value if the caller already set it. record.setdefault(LEGACY_COLUMN, record.pop(column)) converted = True + break # Only one dimension column expected per record return converted
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
archon-ui-main/src/App.tsxmigration/cleanup_stale_credentials.pymigration/complete_setup.sqlpython/src/mcp_server/mcp_server.pypython/src/server/api_routes/knowledge_api.pypython/src/server/config/config.pypython/src/server/main.pypython/src/server/services/credential_service.pypython/src/server/services/storage/code_storage_service.pypython/src/server/services/storage/document_storage_service.pypython/src/server/services/storage/embedding_schema_support.py
🧰 Additional context used
📓 Path-based instructions (8)
python/src/server/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/**/*.py: Fail fast and loud for service startup failures, missing configuration, database connection failures, authentication/authorization failures, data corruption, critical dependency unavailability, and invalid data that would corrupt state
Never accept corrupted data - skip failed items entirely rather than storing corrupted or incomplete data
Files:
python/src/server/services/credential_service.pypython/src/server/services/storage/document_storage_service.pypython/src/server/config/config.pypython/src/server/services/storage/code_storage_service.pypython/src/server/main.pypython/src/server/services/storage/embedding_schema_support.pypython/src/server/api_routes/knowledge_api.py
python/src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/**/*.py: Complete batch processing, background tasks, and external API calls but log detailed errors for each failure instead of crashing
Include context about what was being attempted, preserve full stack traces with exc_info=True in Python logging, use specific exception types, include relevant IDs/URLs, never return None to indicate failure, and report both success count and detailed failure list for batch operations
Use Ruff for linting and MyPy for type checking with 120 character line length in Python backend
Use Python 3.12 with 120 character line length in backend code
Import GITHUB_REPO_OWNER and GITHUB_REPO_NAME constants from python/src/server/config/version.py as the single source of truth for repository configuration
python/src/**/*.py: Python backend code should use 120 character line length
Use Ruff for linting Python code to check for errors, warnings, and unused imports
Use Mypy for type checking in Python to ensure type safety
Files:
python/src/server/services/credential_service.pypython/src/server/services/storage/document_storage_service.pypython/src/server/config/config.pypython/src/server/services/storage/code_storage_service.pypython/src/server/main.pypython/src/mcp_server/mcp_server.pypython/src/server/services/storage/embedding_schema_support.pypython/src/server/api_routes/knowledge_api.py
python/src/server/services/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
python/src/server/services/**/*.py: For batch processing and background tasks, complete what you can and report detailed failures for each item rather than crashing
Never accept corrupted data - skip failed items entirely rather than storing corrupted data in batch operations
Files:
python/src/server/services/credential_service.pypython/src/server/services/storage/document_storage_service.pypython/src/server/services/storage/code_storage_service.pypython/src/server/services/storage/embedding_schema_support.py
python/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Error messages must include context about what was being attempted, preserve full stack traces, use specific exception types, include relevant IDs/URLs, and report both success count and detailed failure list for batch operations
Files:
python/src/server/services/credential_service.pypython/src/server/services/storage/document_storage_service.pypython/src/server/config/config.pypython/src/server/services/storage/code_storage_service.pypython/src/server/main.pypython/src/mcp_server/mcp_server.pypython/src/server/services/storage/embedding_schema_support.pypython/src/server/api_routes/knowledge_api.py
python/src/server/main.py
📄 CodeRabbit inference engine (AGENTS.md)
Service startup failures, missing configuration, database connection failures, and authentication/authorization failures should crash with clear errors
Files:
python/src/server/main.py
python/src/server/api_routes/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
python/src/server/api_routes/**/*.py: Follow the API route → Service → Database pattern for new API endpoints
Use database values directly without frontend mapping to maintain type-safe end-to-end communication from backend upward
Files:
python/src/server/api_routes/knowledge_api.py
archon-ui-main/src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
archon-ui-main/src/**/*.{ts,tsx}: Use TypeScript in strict mode with no implicit any in frontend code
Do not use dynamic Tailwind class construction in frontend components
Files:
archon-ui-main/src/App.tsx
archon-ui-main/src/**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
Use ESLint for legacy frontend code outside /src/features directory with standard React rules
Files:
archon-ui-main/src/App.tsx
🧠 Learnings (13)
📚 Learning: 2025-08-20T19:38:04.097Z
Learnt from: Chillbruhhh
Repo: coleam00/Archon PR: 378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.097Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly without requiring additional migrations.
Applied to files:
python/src/server/services/storage/document_storage_service.pymigration/complete_setup.sql
📚 Learning: 2025-08-20T19:38:04.097Z
Learnt from: Chillbruhhh
Repo: coleam00/Archon PR: 378
File: python/src/server/services/storage/document_storage_service.py:304-306
Timestamp: 2025-08-20T19:38:04.097Z
Learning: The archon_crawled_pages table in the Archon project has a table-level unique constraint on (url, chunk_number) defined inline in the CREATE TABLE statement in migration/complete_setup.sql at line 202, which allows upsert operations with on_conflict="url,chunk_number" to work properly.
Applied to files:
python/src/server/services/storage/document_storage_service.pymigration/complete_setup.sql
📚 Learning: 2025-11-29T19:24:46.265Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: archon-example-workflow/CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:46.265Z
Learning: Check if Archon MCP server is available before doing anything else in task management scenarios
Applied to files:
python/src/mcp_server/mcp_server.py
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to archon-ui-main/src/!(features)/**/*.{ts,tsx} : Use ESLint with standard React rules for legacy code in the frontend
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/src/features/**/components/**/*.{ts,tsx} : Use Radix UI primitives from src/features/ui/primitives/ for new UI components in features directory
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/src/**/*.{ts,tsx,js,jsx} : Use ESLint for legacy frontend code outside /src/features directory with standard React rules
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/src/features/**/*.{ts,tsx} : TypeScript strict mode enabled with no implicit any, and use Biome with 120 character lines and double quotes in /src/features directory
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to archon-ui-main/src/features/**/components/**/*.{ts,tsx} : Prefer using Radix UI primitives from src/features/ui/primitives/ when creating UI components
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to archon-ui-main/src/features/**/*.{ts,tsx} : Apply Tron-inspired glassmorphism styling with Tailwind and follow responsive design patterns with mobile-first approach
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/src/features/**/components/**/*.{ts,tsx} : Apply Tron-inspired glassmorphism styling with Tailwind in features directory UI components
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:25:23.987Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-11-29T19:25:23.987Z
Learning: Applies to archon-ui-main/**/*.test.{ts,tsx} : Use Vitest with React Testing Library for frontend testing
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-09-04T16:30:05.227Z
Learnt from: stevepresley
Repo: coleam00/Archon PR: 573
File: archon-ui-main/src/config/api.ts:15-25
Timestamp: 2025-09-04T16:30:05.227Z
Learning: Archon UI API config: Prefer lazy getters getApiFullUrl() and getWsUrl() over module-load constants to avoid SSR/test crashes. Avoid CommonJS exports patterns (Object.defineProperty(exports,…)) in ESM. Add typeof window guards with VITE_API_URL fallback inside getApiUrl()/getWebSocketUrl() when SSR safety is required.
Applied to files:
archon-ui-main/src/App.tsx
📚 Learning: 2025-11-29T19:24:36.223Z
Learnt from: CR
Repo: coleam00/Archon PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-11-29T19:24:36.223Z
Learning: Applies to archon-ui-main/src/**/*.{test,spec}.{ts,tsx} : Use Vitest with React Testing Library for frontend testing
Applied to files:
archon-ui-main/src/App.tsx
🧬 Code graph analysis (4)
python/src/server/services/credential_service.py (1)
python/src/server/utils/progress/progress_tracker.py (1)
error(196-219)
python/src/server/services/storage/document_storage_service.py (1)
python/src/server/services/storage/embedding_schema_support.py (4)
determine_embedding_column(40-47)legacy_column_in_use(100-103)note_multi_dim_success(50-55)should_retry_with_legacy_column(58-79)
python/src/mcp_server/mcp_server.py (1)
python/src/server/utils/progress/progress_tracker.py (1)
error(196-219)
python/src/server/api_routes/knowledge_api.py (1)
python/src/server/main.py (2)
main(404-423)_check_database_schema_cached(297-336)
🔇 Additional comments (22)
python/src/server/config/config.py (1)
139-146: LGTM! Secure patterns for Docker Compose compatibility.The hostname allowlisting is secure:
.localhostis a reserved TLD (RFC 6761) that cannot be registered publicly._supabasesuffix is safe because underscores are invalid in DNS hostnames (RFC 1123), meaning only Docker's internal DNS can resolve such names—no subdomain bypass risk from public domains.python/src/mcp_server/mcp_server.py (3)
17-17: LGTM!Necessary import for the
asyncio.Lockused in the initialization logic.
67-72: Well-documented lock selection.The comment clearly explains the tradeoff:
asyncio.Lockis appropriate for single Uvicorn worker and avoids blocking the event loop, while noting the constraint for multi-worker deployments. This is the correct choice for FastMCP's intended deployment model.
144-148: LGTM!The quick-path optimization is correct. The explicit
is not Nonecheck and the double-check pattern inside the lock (line 156) ensure thread-safety in the async context.python/src/server/services/credential_service.py (2)
8-8: LGTM!The imports for
asyncioandAPIErrorare correctly added to support the new retry logic with exponential backoff.Also applies to: 21-21
138-154: LGTM!The success path correctly processes credentials and updates the cache. The fail-fast behavior on corrupted data (e.g., missing
keyfield) is appropriate for service startup per coding guidelines. The logging of loaded credential count provides good observability.archon-ui-main/src/App.tsx (1)
105-105: LGTM! Good forward-compatibility preparation.Adding the React Router v7 future flags (
v7_startTransition,v7_relativeSplatPath) is a best practice for incremental migration. This enables the new behaviors early, allowing you to catch any issues before the v7 release.migration/complete_setup.sql (2)
91-107: LGTM! Correct idempotent INSERT pattern.Adding
ON CONFLICT (key) DO NOTHINGto seed data INSERTs ensures migrations can be safely re-run without duplicate key errors. This appropriately preserves user-customized values while allowing initial seeding.
51-70: Good idempotent pattern, with potential simplification for PostgreSQL 14+.The trigger existence check with DROP/CREATE pattern ensures safe re-runs. For PostgreSQL 14+, you could simplify this using
CREATE OR REPLACE TRIGGERalone, as it already handles the "replace if exists" case. The current approach is more verbose but works on older PostgreSQL versions.python/src/server/main.py (2)
219-260: LGTM! Well-designed non-blocking health check.The timeout-wrapped schema check prevents health endpoint timeouts when the database is slow. The fallback to "assume valid" on timeout/error is appropriate for health checks - it keeps the service running while logging the issue. The nested timeout structure (2s outer, 1.5s inner) provides appropriate headroom.
297-336: LGTM! Smart caching strategy.The caching approach is well-designed:
- Permanently cache valid schema (common case, no need to re-check)
- Cache failures for 30 seconds (allow recovery without hammering DB)
- Don't cache timeouts (allow immediate retry)
This minimizes database load while ensuring the health check remains responsive.
python/src/server/api_routes/knowledge_api.py (2)
1037-1055: LGTM! Good extensibility improvement.Adding
**extra_fieldsto the progress callback allows richer contextual data to flow through without changing the signature again. The PEP 604 type hint (dict | None) is appropriate for Python 3.12.
1263-1265: LGTM! Correctly updated to use async cached schema check.The change to use
_check_database_schema_cachedwithawaitis correct and maintains consistency with the main health endpoint. Both endpoints now benefit from the non-blocking, cached schema validation.migration/cleanup_stale_credentials.py (1)
28-39: Encryption parameters are correctly synchronized withcredential_service.py. All values match exactly: salt, iterations, and default key are consistent between both files.python/src/server/services/storage/embedding_schema_support.py (3)
1-28: LGTM! Module structure and caching approach are sound.The tri-state cache (
None→ undetermined,True→ multi-dim supported,False→ legacy only) elegantly handles runtime schema detection without repeated database introspection.
58-79: LGTM! Error-driven fallback pattern is well-designed.The function correctly detects PostgREST column-missing errors and transparently migrates records to the legacy schema. The in-place mutation is clearly documented and avoids unnecessary copying.
82-84: LGTM!Helper functions are minimal and correct.
Also applies to: 100-103
python/src/server/services/storage/document_storage_service.py (4)
14-19: LGTM! Clean import of schema support utilities.The new imports consolidate embedding column logic into a dedicated module, improving maintainability.
394-401: LGTM! Good fallback with appropriate warning.Using
determine_embedding_columncentralizes the logic, and the warning correctly alerts when an unexpected dimension falls back to a default column.
442-447: LGTM! Upsert prevents duplicate key errors.The switch from
inserttoupsertwithon_conflict="url,chunk_number"aligns with the table's unique constraint and prevents failures during re-crawls. Based on learnings, thearchon_crawled_pagestable has this constraint defined.
507-513: LGTM! Individual record fallback uses consistent upsert semantics.The per-record fallback path now matches the batch upsert behavior, ensuring duplicate-key safety even when batch operations fail.
python/src/server/services/storage/code_storage_service.py (1)
576-593: Good async correction. No external callers found.Converting to native async is the right fix—calling
asyncio.run()from within an already-running event loop would raiseRuntimeError. A comprehensive search of the codebase confirmsgenerate_code_example_summaryhas no external callers; the only internal usage (line 1065) correctly awaits the function. The async conversion is safe and poses no breaking changes to the current codebase.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| except APIError as api_error: | ||
| if api_error.code == "PGRST205" and attempt < retries: | ||
| logger.warning( | ||
| "Supabase schema cache not ready (attempt %s/%s); retrying in %ss", | ||
| attempt, | ||
| retries, | ||
| delay, | ||
| ) | ||
| await asyncio.sleep(delay) | ||
| delay = min(delay * 2, 10) | ||
| continue | ||
| logger.error(f"Error loading credentials: {api_error}") | ||
| raise | ||
| except Exception as e: | ||
| logger.error(f"Error loading credentials: {e}") | ||
| raise | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Error loading credentials: {e}") | ||
| raise | ||
| raise RuntimeError("Failed to load credentials after retries") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n python/src/server/services/credential_service.py | sed -n '130,180p'Repository: coleam00/Archon
Length of output: 2484
🏁 Script executed:
fd credential_service.py | head -5Repository: coleam00/Archon
Length of output: 153
🏁 Script executed:
head -50 python/src/server/services/credential_service.pyRepository: coleam00/Archon
Length of output: 1362
🌐 Web query:
postgrest-py APIError code attribute type
💡 Result:
Short answer: APIError.code is not a fixed Python type — it's whatever value PostgREST returns in the JSON error payload (so usually a string, but can be an int in some responses). The postgrest-py APIError is raised with the raw r.json() dict, so check the runtime type before using it. [1][2][3]
Sources
- postgrest-py docs (execute() raises APIError with the response JSON). [1]
- Issue/trace showing APIError containing SQLSTATE-style string code ('42804'). [2]
- Discussion showing APIError containing a numeric code (204) in some cases. [3]
Unreachable code and missing stack trace preservation in error logging.
This code has two issues per the coding guidelines:
-
Line 173 is unreachable: The loop always exits via
return(success on line 154),raise(APIError on final attempt or line 168, or other exceptions on line 171), orcontinue(retry). TheRuntimeErroron line 173 can never be executed and should be removed. -
Lines 167 and 170 missing
exc_info=True: Per coding guidelines, error logging must preserve full stack traces for debugging. Bothlogger.error()calls needexc_info=True.
Proposed fix
except APIError as api_error:
if api_error.code == "PGRST205" and attempt < retries:
logger.warning(
"Supabase schema cache not ready (attempt %s/%s); retrying in %ss",
attempt,
retries,
delay,
)
await asyncio.sleep(delay)
delay = min(delay * 2, 10)
continue
- logger.error(f"Error loading credentials: {api_error}")
+ logger.error(f"Error loading credentials after {attempt} attempts: {api_error}", exc_info=True)
raise
except Exception as e:
- logger.error(f"Error loading credentials: {e}")
+ logger.error(f"Error loading credentials: {e}", exc_info=True)
raise
-
- raise RuntimeError("Failed to load credentials after retries")🤖 Prompt for AI Agents
In `@python/src/server/services/credential_service.py` around lines 156 - 173, The
except blocks in credential_service.py currently log errors without stack traces
and leave an unreachable RuntimeError at the end; update the two
logger.error(...) calls inside the APIError and generic except blocks to pass
exc_info=True so the full traceback is preserved, and remove the final raise
RuntimeError("Failed to load credentials after retries") since control either
returns, continues, or re-raises inside the loop; keep existing re-raise
behavior for the exceptions (do not swallow them).
| except Exception as e: | ||
| if should_retry_with_legacy_column(e, batch_data): | ||
| retry_delay = 1.0 | ||
| continue |
There was a problem hiding this comment.
Bug: Legacy column fallback may skip the actual retry on the last attempt.
If should_retry_with_legacy_column returns True during the final retry iteration (retry=2), the continue statement exits the current iteration, but the for loop ends without executing another iteration. The records are converted to use the legacy column but never actually retried.
Example scenario:
retry=2(last attempt) → upsert fails with "column not found"should_retry_with_legacy_column(e, batch_data)returnsTrue, mutatesbatch_datacontinue→ loop ends (noretry=3exists)- Converted records are never persisted
🐛 Suggested fix using a flag to force one more retry
max_retries = 3
retry_delay = 1.0
+ force_legacy_retry = False
- for retry in range(max_retries):
+ retry = 0
+ while retry < max_retries or force_legacy_retry:
+ force_legacy_retry = False # Reset after consuming the forced retry
# Check for cancellation before each retry attempt
if cancellation_check:
try:
cancellation_check()
except asyncio.CancelledError:
# ... existing cancellation handling ...
raise
try:
client.table("archon_crawled_pages").upsert(
batch_data,
on_conflict="url,chunk_number"
).execute()
if not legacy_column_in_use():
note_multi_dim_success()
total_chunks_stored += len(batch_data)
# ... rest of success handling ...
break
except Exception as e:
if should_retry_with_legacy_column(e, batch_data):
retry_delay = 1.0
+ force_legacy_retry = True
+ retry += 1
continue
- if retry < max_retries - 1:
+ elif retry < max_retries - 1:
search_logger.warning(
f"Error inserting batch (attempt {retry + 1}/{max_retries}): {e}"
)
await asyncio.sleep(retry_delay)
retry_delay *= 2 # Exponential backoff
+ retry += 1
else:
# ... existing fallback to individual inserts ...
+ break🤖 Prompt for AI Agents
In `@python/src/server/services/storage/document_storage_service.py` around lines
475 - 478, The retry loop currently uses continue when
should_retry_with_legacy_column(e, batch_data) mutates batch_data, which can
skip the actual retry if that happens on the final iteration; change the logic
so that when should_retry_with_legacy_column(...) returns True you set a boolean
flag (e.g., force_one_more_retry) and, if on the last configured retry
iteration, arrange for one additional loop iteration (for example by checking
force_one_more_retry in the loop condition or incrementing the retry limit) so
the mutated batch_data is actually retried; ensure the flag is cleared after the
forced retry so normal retry behavior resumes.
This PR consolidates several improvements for Docker-based deployments: ## Non-blocking Health Check (main.py) - Health endpoint now uses asyncio.wait_for with 2s timeout - Prevents health checks from hanging when database is slow - Gracefully degrades with timeout/error flags instead of failing ## Docker Container URL Validation (config.py) - Allow hostnames ending with _supabase (e.g., supabase_kong_supabase) - Fixes URL validation for Docker Compose Supabase deployments - Maintains security by only allowing known container naming patterns ## MCP Server Async Improvements (mcp_server.py) - Changed threading.Lock to asyncio.Lock to avoid blocking event loop - Improved double-check pattern with proper null safety - Better documentation of thread safety considerations ## Credential Service Resilience (credential_service.py) - Added retry logic with exponential backoff for PGRST205 errors - Handles Supabase schema cache not ready during startup - Prevents startup failures when Supabase is still initializing ## Document Storage Improvements (document_storage_service.py) - Changed insert to upsert with on_conflict="url,chunk_number" - Prevents duplicate key errors when re-crawling same URLs - Added embedding schema fallback support for mixed deployments ## New Files - embedding_schema_support.py: Schema compatibility helpers for embedding columns - cleanup_stale_credentials.py: Utility to remove undecryptable credentials after SUPABASE_SERVICE_KEY rotation
## Additional Changes (from audit) ### knowledge_api.py - Extensible progress callback with **extra_fields parameter - Uses cached schema check function (_check_database_schema_cached) - PEP 604 type hints for better code quality ### code_storage_service.py - Converted generate_code_example_summary from sync to async - Eliminates asyncio.run() which can cause event loop conflicts - Function not called externally, safe breaking change ### complete_setup.sql - Idempotent trigger creation (DROP IF EXISTS before CREATE) - Added ON CONFLICT DO NOTHING to INSERT statements - Makes migrations safely re-runnable ### App.tsx - Added React Router v7 future flags (v7_startTransition, v7_relativeSplatPath) - Prepares codebase for Router v7 migration - Valid flags for React Router 6.26.2
8274043 to
29ac5e1
Compare
Rename `result_type` to `output_type` in DocumentAgent to fix compatibility with pydantic-ai 1.0.x. The Agent class renamed this parameter in the 1.0 release.
Servers sometimes return HTTP 200 with text/html for missing files instead of a proper 404 status. This caused the crawler to incorrectly detect llms.txt files that don't exist. Added content-type validation to _check_url_exists() that verifies: - .txt/.md files return text/plain or text/markdown (not text/html) - .xml files return text/xml or application/xml variants When a soft 404 is detected, the method now returns False and logs an informative message, allowing the crawler to fall back to HTML crawling correctly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
🔄 This repository is being replaced by a new version of Archon. The original Python/MCP codebase is being archived to the This PR is being closed as part of the migration. Thank you for your contribution! |
…-587 fix(git,isolation): worktree lifecycle gaps — ENOENT distinction, corruption warning, orphan cleanup
…ix-issue-587 fix(git,isolation): worktree lifecycle gaps — ENOENT distinction, corruption warning, orphan cleanup
…ix-issue-587 fix(git,isolation): worktree lifecycle gaps — ENOENT distinction, corruption warning, orphan cleanup
Summary
Consolidation of battle-tested improvements from production Docker deployments. These changes address real-world issues encountered when running Archon in Docker Compose environments with self-hosted Supabase.
Changes (13 files)
Health & Resilience
python/src/server/main.py- Non-blocking health checkasyncio.wait_forwith 2s timeoutpython/src/server/services/credential_service.py- Startup resiliencepython/src/server/services/storage/document_storage_service.py- Idempotent upsertsinserttoupsertwithon_conflict="url,chunk_number"Async Correctness
python/src/mcp_server/mcp_server.py- Event loop safetythreading.Locktoasyncio.Lockto avoid blocking event looppython/src/server/services/storage/code_storage_service.py- Proper async functiongenerate_code_example_summaryfrom sync to asyncasyncio.run()which can cause "event loop already running" errorspython/src/server/api_routes/knowledge_api.py- Extensible callbacks**extra_fieldsfor extensibilitydict | Nonesyntax)Bug Fixes
python/src/agents/document_agent.py- Pydantic AI compatibilityresult_typetooutput_typefor pydantic-ai 1.0.x compatibilitypython/src/server/services/crawling/discovery_service.py- Soft 404 detection_check_url_exists()llms.txtand similar filesDocker Compatibility
python/src/server/config/config.py- Container hostname support_supabase(e.g.,supabase_kong_supabase)python/src/server/services/storage/embedding_schema_support.py- NEWMigrations & Maintenance
migration/complete_setup.sql- Idempotent migrationsON CONFLICT DO NOTHINGto INSERT statementsmigration/cleanup_stale_credentials.py- NEWSUPABASE_SERVICE_KEYrotationFrontend
archon-ui-main/src/App.tsx- React Router v7 preparationv7_startTransition,v7_relativeSplatPathTest Plan
sudo ./bootstrap-archon.sh)Breaking Changes
None. All changes are backwards compatible.
Notes
_supabasehostname pattern is specific to Docker Compose service naming conventionsSummary by CodeRabbit
New Features
Improvements
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.